continuous visual tokens AI News List

continuous visual tokens AI News List | Blockchain.News

AI News List

List of AI News about continuous visual tokens

Time	Details
2025-11-26 11:09	Chain-of-Visual-Thought (COVT): Revolutionizing Visual Language Models with Continuous Visual Tokens for Enhanced Perception According to @godofprompt, the new research paper 'Chain-of-Visual-Thought (COVT)' introduces a breakthrough method for Visual Language Models (VLMs) by enabling them to reason using continuous visual tokens instead of traditional text-based chains of thought. This approach allows models to generate mid-thought visual latents such as segmentation cues, depth maps, edges, and DINO features, effectively giving the model a 'visual scratchpad' for spatial and geometric reasoning. The results are significant: COVT models achieved a 14% improvement in depth reasoning, a 5.5% boost on CV-Bench, and major gains on HRBench and MMVP benchmarks. The technique is compatible with leading VLMs like Qwen2.5-VL and LLaVA, with interpretable visual tokens that can be decoded for transparency. Notably, the research finds that traditional text-only reasoning chains actually degrade visual reasoning performance, whereas COVT’s visual grounding enhances accuracy in counting, spatial understanding, 3D awareness, and reduces hallucinated outputs. These findings point to transformative business opportunities for AI solutions requiring fine-grained visual analysis, accurate object recognition, and reliable spatial intelligence, especially in fields like robotics, autonomous vehicles, and advanced multimodal search. (Source: @godofprompt, Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens, 2025) Source

Time

Details

2025-11-26
11:09

Chain-of-Visual-Thought (COVT): Revolutionizing Visual Language Models with Continuous Visual Tokens for Enhanced Perception

According to @godofprompt, the new research paper 'Chain-of-Visual-Thought (COVT)' introduces a breakthrough method for Visual Language Models (VLMs) by enabling them to reason using continuous visual tokens instead of traditional text-based chains of thought. This approach allows models to generate mid-thought visual latents such as segmentation cues, depth maps, edges, and DINO features, effectively giving the model a 'visual scratchpad' for spatial and geometric reasoning. The results are significant: COVT models achieved a 14% improvement in depth reasoning, a 5.5% boost on CV-Bench, and major gains on HRBench and MMVP benchmarks. The technique is compatible with leading VLMs like Qwen2.5-VL and LLaVA, with interpretable visual tokens that can be decoded for transparency. Notably, the research finds that traditional text-only reasoning chains actually degrade visual reasoning performance, whereas COVT’s visual grounding enhances accuracy in counting, spatial understanding, 3D awareness, and reduces hallucinated outputs. These findings point to transformative business opportunities for AI solutions requiring fine-grained visual analysis, accurate object recognition, and reliable spatial intelligence, especially in fields like robotics, autonomous vehicles, and advanced multimodal search. (Source: @godofprompt, Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens, 2025)

Source